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Abstract A phylogenetic network is a directed acyclic graph that visualises an evolutionary 
history containing so-called reticulations such as recombinations, hybridisations or lateral gene 
transfers. Here we consider the construction of a simplest possible phylogenetic network con- 
sistent with an input set T, where T contains at least one phylogenetic tree on three leaves (a 
triplet) for each combination of three taxa. To quantify the complexity of a network we consider 
both the total number of reticulations and the number of reticulations per biconnected com- 
ponent, called the level of the network. We give polynomial-time algorithms for constructing a 
level- 1 respectively a level-2 network that contains a minimum number of reticulations and is 
consistent with T (if such a network exists). In addition, we show that if T is precisely equal 
to the set of triplets consistent with some network, then we can construct such a network with 
smallest possible level in time 0(|T| fc+1 ), if k is a fixed upper bound on the level of the network. 

1 Introduction 

One of the ultimate goals in computational biology is to create methods that can reconstruct evolution- 
ary histories from biological data of currently living organisms. The immense complexity of biological 
evolution makes this task almost a hopeless one [20]. This has motivated researchers to focus first on 
the simplest possible pattern of evolution. This least complicated shape of an evolutionary history is 
the tree-shape. Now that treelike evolution has been extremely well studied, a logical next step is to 
consider slightly more complicated evolutionary scenarios, gradually extending the complexity that 
our models can describe. At the same time we also wish to take into account the parsimony principle, 
which tells us that amongst all equally good explanations of our data, one prefers the simplest one 
(see e.g. [11]). 

For a set of taxa (e.g. species or strains), a phylogenetic tree describes (a hypothesis of) the evolution 
that these taxa have undergone. The taxa form the leaves of the tree while the internal vertices rep- 
resent events of genetic divergence: one incoming branch splits into two (or more) outgoing branches. 

Phylogenetic networks form an extension to this model where it is also possible that two branches 
combine into one new branch. We call such an event a reticulation, which can model any kind of 
non-treelike (also called "reticulate") evolutionary process such as recombination, hybridisation or 
lateral gene transfer. In addition, reticulations can also be used to display different possible (treelike) 
evolutions in one figure. In recent years there has emerged enormous interest in phylogenetic networks 
and their application [3] [12] [18] [20] [21]. 

This model of a phylogenetic network allows for many different degrees of complexity, ranging from 
networks that are equal, or almost equal, to a tree to unrealistically complex webs of frequently di- 
verging and recombining lineages. Therefore we consider two different measures for the complexity of 
a network. The first of these measures is the total number of reticulations in the network. Secondly, 
we consider the level of the network, which is an upper bound on the number of reticulations per 
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biconncctcd component of the network. Informally, the level of a network is a bound on the number of 
reticulations that can be mutually dependent. In this paper we consider two different approaches for 
constructing networks that are as simple as possible. The first approach minimises the total number 
of reticulations for a fixed level (of at most two) and the second approach minimises (under certain 
special restrictions on the input) the level while the total number of reticulations is unrestricted. 

Level-fc phylogenctic networks were first introduced by Choy et al. [7] and further studied by dif- 
ferent authors [14] [15] [17]. Gusficld et al. gave a biological justification for level-1 networks (which 
they call "galled trees") [9]. Minimising reticulations has been very well studied in the framework 
where the input consists of (binary) sequences [10] [11] [23] [24]. For example, Wang et al. considered 
the problem of finding a "perfect phylogcny" with a minimum number of reticulations and showed 
that this problem is NP-hard [25] . Gusficld et al. showed that this problem can be solved in polynomial 
time if restricted to level-1 networks [9]. 

There are also several results known already about the version of the problem where the input con- 
sists of a set of trees and the objective is to construct a network that is "consistent" with each of the 
input trees. Baroni et al. give bounds on the minimum number or reticulations needed to combine two 
trees [2] and Bordewich et al. showed that it is APX-hard to compute this minimum number exactly 
[4]. However, there exists an exact algorithm [5] that runs reasonably fast in many practical situations. 

In this paper we also consider input sets consisting of trees, but restrict ourselves to small trees with 
three leaves each, called triplets. See Figure 1 for an example. Triplets can for example be constructed 
by existing methods, such as Maximum Parsimony or Maximum Likelihood, that work accurately and 
fast for small numbers of taxa. Triplet-based methods have also been well-studied. Aho et al. [1] gave 
a polynomial-time algorithm to construct a tree from triplets if there exists a tree that is consistent 
with all input triplets. Jansson et al. [16] showed that the same is possible for level-1 networks if the 
input triplet set is dense, i.e. if there is a triplet for any set of three taxa. Van Iersel et al. further 
extended this result to level-2 networks [14]. From non-dense triplet sets it is NP-hard to construct 
level-fc networks for any k > 1 [15] [16]. From the proof of this result also follows directly that it is 
NP-hard to find a network consistent with a non-dense triplet set that contains a minimum number 
of reticulations. 3 It is unknown whether this problem becomes easier if the input triplet set is dense. 



Figure 1. One of the three possible triplets on the leaves x, y and z. Note that, as with all figures in this 
article, all arcs are directed downwards. 

In the first part of this paper we consider fixed-level networks and aim to minimise the total number 
of reticulations in these networks. In Section 3 we give a polynomial-time algorithm that constructs 
a level-1 network consistent with a dense triplet set T (if such a network exists) and minimises the 
total number of reticulations over all such networks. This gives an extension to the algorithm by 
Jansson et al. [16], which can also reconstruct level-1 networks, but does not minimise the number of 
reticulations. To illustrate this we give in Section 2 an example dense triplet set on n leaves for which 
the algorithm in [16] (and the ones in [14] and [17]) creates a level-1 network with reticulations. 
However, a level-1 network with just one reticulation is also possible and our algorithm MARLON is 
able to find that network. We have implemented MARLON (made publicly available [19]) and tested 
it on simulated data. Results are in Section 4. The worst case running time of the algorithm is 0(n 5 ) 
for n leaves (and hence 0(|T|3) with |T| the input size). 



3 This follows from the proof of Theorem 7 in [16], since only one reticulation is used in their reduction. 
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In Section 5 we further extend this approach by giving an algorithm that even constructs a level- 
2 network consistent with a dense triplet set (if one exists) and again minimises the total number 
of reticulations over all such networks. This means that if the level is at most two, we can minimise 
both the level and the total number of reticulations, giving priority to the criterium that we find most 
important. The running time is 0(n 9 ) (and thus 0(\T\ 3 j). 

Minimising the level of phylogcnetic networks becomes even more challenging when this level can 
be larger than two, even without minimising the total number of reticulations. Given a dense set of 
triplets, it is a major open problem whether one can construct a minimum level phylogcnetic network 
consistent with these triplets in polynomial time. Moreover, it is not even known whether it is possible 
to construct a level-3 network consistent with a dense input triplet set in polynomial time. In Section 6 
of this paper we show some significant progress in this direction. As a first step we consider the restric- 
tion to "simple" networks, i.e. networks that contain just one nontrivial biconnected component. We 
show how to construct, in 0(|T| fe+1 ) time, a minimum level simple network with level at most k from 
a dense input triplet set (for fixed k). Subsequently we show that this can be used to also generate 
general level-fc networks if we put an extra restriction on the quality of the input triplets. Namely, 
we assume that the input set contains exactly all triplets consistent with some network. If that is the 
case then our algorithm can find such a network with a smallest possible level. The algorithm runs 
in polynomial time 0(|T| +1 ) if the upper bound k on the level of the network is fixed. This result 
constitutes an important step forward in the analysis of level-fc networks, since it provides the first 
positive result that can be used for all levels k. 

2 Preliminaries 

A phylogenetic network (network for short) is defined as a directed acyclic graph in which exactly 
one vertex has indegree and outdegree 2 (the root) and all other vertices have either indegree 1 
and outdegree 2 (split vertices), indegree 2 and outdegree 1 (reticulation vertices, or reticulations for 
short) or indegree 1 and outdegree (leaves), where the leaves are distinctly labelled. A phylogenetic 
network without reticulations is called a phylogenetic tree. 

A directed acyclic graph is connected (also called "weakly connected" ) if there is an undirected path 
between any two vertices and biconnected if it contains no vertex whose removal disconnects the 
graph. A biconnected component of a network is a maximal biconnected subgraph and is called trivial 
if it is equal to two vertices connected by an arc. To avoid "redundant" networks we assume that in 
any network every nontrival biconnected component has at least three outgoing arcs. We call an arc 
a = (u, v) of a network N a cut-arc if its removal disconnects N and call it trivial if v is a leaf. 

Definition 1. A network is said to be a level- fc network if each biconnected component contains at 
most k reticulations. 

A level-fc network that contains no nontrivial cut-arcs and is not a level-(fc — 1) network is called a 
simple level- fc network 4 . Informally, a simple network thus consists of a nontrivial biconnected com- 
ponent with leaves "hanging" of it. 

A triplet xy\z is a phylogenetic tree on the leaves x, y and z such that the lowest common an- 
cestor of x and y is a proper descendant of the lowest common ancestor of x and z. The triplet xy\z is 
displayed in Figure 1. Denote the set of leaves in a network N by Ljy. For any set T of triplets define 
L(T) = U t6T L t and let n = \L(T)\. A set T of triplets is called dense if for each {x, y, z} C L(T) at 
least one of xy\z, xz\y and yz\x belongs to T. 

For a set of triplets T and a set of leaves V C L(T), we denote by T\L' the triplets ! e T with 
L t C V . Furthermore, if C = {Si, . . . , S q } is a collection of leaf-sets we use TVC to denote the induced 
set of triplets SiSj\Sk such that there exist x £ Si, y £ Sj, z £ Sk with xy\z £ T and i, j and k all 
distinct. 

4 This definition is equivalent to Definition 4 in [14] by Lemma 2 in [14]. 
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Definition 2. A triplet xy\z is consistent with a network N (interchangeably: N is consistent with 
xy\z) if N contains a subdivision of xy\z, i.e. if N contains vertices u ^ v and pairwise internally 
vertex-disjoint paths u — > x, u — > y, v — > u and v — > z. 

The above definitions enable us to give a formal description of the problems we consider. 

Problem: Minimum Reticulation Level- fc network on dense triplet sets (DMRL-fc). 
Input: dense set of triplets T. 

Output: level- fc network N that is consistent with T (if such a network exists) and has a minimum 
number of reticulations over all such networks. 

A feasible solution to DMRL-1 can be found by the algorithm in [16] or [17] and the algorithm 
in [14] finds a feasible solution to DMRL-2. To show that these algorithms do not always minimise 
the number of reticulations, consider a triplet set over an odd number n of leaves, labelled 1, . . . , n, 
containing all triplets ab\c with a,b > c and the triplets a(a + l)|n with a = 1,3, ... ,n — 2. The 
aforementioned algorithms find for this input set a level- 1 network with reticulations. However, 
a level-1 network with just one reticulation is also possible and our algorithm MARLON, introduced 
shortly is able to find that network. See Figure 2 for an example for n = 9. 




Figure 2. Example of a situation where previous algorithms (by Jansson et al. [16] and Van Iersel et al. [14]) 
construct a network like the one to the left with 2=i reticulations, while MARLON constructs the network 
to the right, with just one reticulation. 

Given a network N let T(N) denote the set of all triplets consistent with N. We say that a network 
N reflects a triplet set T if T(N) = T. If, for a triplet set T, there exists a network N that reflects it, 
we say that T is reflective. The second problem we consider is thus the following: 

Problem: MIN-REFLECT-fc 
Input: set of triplets T. 

Output: level-fc network N that reflects T (if such a network exists) and has the smallest possible 
level over all such networks. 

Note that this problem is closely related to the mixed triplets problem (MT) studied in [?], which 
asks for a phylogenctic network consistent with an input triplet set T and not consistent with another 
input triplet set F. Namely, MIN-REFLECT-fc is a special case of MT restricted to level-fc networks 
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where the set F of forbidden triplets contains all triplets that are not in T. 

To describe our algorithms we need to introduce some more definitions. We say that a cut-arc is a 
highest cut-arc if it is not reachable from any other cut-arc. We call a cycle containing the root a 
highest cycle and a reticulation in such a cycle a highest reticulation. We say that a leaf x is be- 
low an arc (u, v) (and below vertex v) if x is reachable from v. In the next section we will frequently 
use the set BHR(N), which denotes the set of leaves in network N that is below a highest reticulation. 

A subset S of the leaves is an SN-set (w.r.t. triplet set T) if there is no triplet xy\z in T with 
x,z E S, y £ S. An SN-set is called nontrivial if it does not contain all leaves. We say that an SN-set 
S is maximal (under restriction X) if there is no nontrivial SN-set (satisfying restriction X) that is a 
strict superset of S. An SN-set of size 1 is called a singleton SN-set. 

Any two SN-sets w.r.t. a dense triplet set are either disjoint or one is included in the other [17, 
Lemma 8], which leads to the following definition. The SN-tree is a directed tree with vertices with 
outdegree greater or equal to two, such that the SN-sets of T correspond exactly to the sets of leaves 
reachable from a vertex of the SN-tree. It follows that there are at most 2(n — 1) nontrivial SN-sets 
in a dense triplet set T. All these SN-sets can be found by constructing the SN-tree in 0(n 3 ) time 
[16]. If a network is consistent with a dense triplet set T, then the set of leaves S below any cut-arc is 
always an SN-set, since triplets of the form xy\z with x,z 6 S, y S, are not consistent with such a 
network. Furthermore, each maximal SN-set is equal to the union of leaves below one or more highest 
cut-arcs [13, Lemma 5]. 

3 Constructing a Level-1 Network with a Minimum Number of 
Reticulations 

We propose the following dynamic programming algorithm for solving DMRL-1. The algorithm con- 
siders all SN-sets from small to large and computes an optimal solution N$ for each SN-set S, based 
on the optimal solutions for included SN-sets. The algorithm considers both the case where the root 
of Ns is contained in a cycle and the case where there are two cut-arcs leaving the root. In the latter 
case there are two SN-sets Si and S 2 that are maximal under the restriction that they are a subset of 
S. If this is the case then the algorithm constructs a candidate for Ns by creating a root connected 
to the roots of A^ and Ns 2 ■ 

The other possibility is that the root of Ns is contained in some cycle. For this case the algorithm 
tries each SN-set as BHR(Ns): the set of leaves below the highest reticulation. The sets of leaves 
below other highest cut-arcs can then be found using the property of optimal level-1 networks outlined 
in Lemma 1. Subsequently, an induced set of triplets is computed, where each set of leaves below a 
highest cut-arc is replaced by a single meta-leaf. A candidate network is constructed by computing 
a simple level- 1 network and replacing each meta-leaf Si by an optimal network Ns { for the corre- 
sponding subset of the leaves. The optimal network Ns is then the network with a minimum number 
of reticulations over all computed networks. 

A structured description of the computations is in Algorithm 1. We use f(L') to denote the minimum 
number of reticulations in any level-1 network consistent with T\L'. In addition, g(L' , S') denotes the 
minimum number of reticulations in any level-1 network consistent with T\L' with BHR(N) = S'. 
The algorithm first computes the optimal number of reticulations. Then a network with this number 
of reticulations is constructed using backtracking. 

To show that the algorithm indeed computes an optimal solution we need the following crucial property 
of optimal level-1 networks. 
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Algorithm 1 MARLON (Minimum Amount of Reticulation Level One Network) 



1: compute the set SN of SN-sets w.r.t. T 
2: for i = 1 . . . n do 

3: for each S in SN of cardinality i do 
4: for each S' G SN with S' C 5 do 

5: let C contain S' and all SN-sets that are maximal under the restriction that they are a subset of S 

and do not contain S' 
6: if TVC is consistent with a simple level- 1 network then 

7: g(S,S'):=l + j: xeC f(X) 

8: if there are exactly two SN-sets Si, S2 G SN that are maximal under the restriction that they are a 
strict subset of S then 

9: ff (S,0) :=/(Si)+/(S 2 ) (C:={S U S 2 }) 

10: /(S) := min g(S,S') over all computed values of p(S, •) 
11: store the optimal C and the corresponding simple level-1 network 
12: construct an optimal network by backtracking. 



Lemma 1. If there exists a solution to DMRL-1, then there also exists an optimal solution N, where 
the sets of leaves below highest cut-arcs equal either (i) BHR(N) and the SN-sets that are maximal 
under the restriction that they do not contain BHR(N), or (ii) the maximal SN-sets (if BHR(N) — %). 

Proof. If BHR(N) — then there are two highest cut-arcs and the sets below them are the maximal 
SN-sets. Otherwise, the root of N is part of a cycle. Let S be a maximal SN-set. We prove the following. 

Claim (1). Maximal SN-set S equals either the set of leaves below a highest cut-arc or the set of 
leaves below a directed path P ending in the highest reticulation or in one of its parents. 

Proof. If S equals the set of leaves below a single highest cut-arc then we are done. From now on 
assume that S equals the set of leaves below different highest cut-arcs. First observe that no two 
leaves in S have the root as their lowest common ancestor, since this would imply that all leaves are 
in S, because S is an SN-set. From this follows that all leaves in S are below some directed path P 
on the highest cycle. First assume that not all leaves reachable from vertices in P arc in S. Then 
there are leaves x, z, y reachable respectively from vertices pi,P2,P3 that are on P (in this order) with 
x,y G S and z ^ S. But this leads to a contradiction because then the triplet xy\z is not consistent 
with N, whilst yz\x and xz\y cannot be in T since S is an SN-set. It remains to prove that P ends in 
either the highest reticulation or in one of its parents. Assume that this is not true, then there exists 
a vertex v on (the interior of) a path from the last vertex of P to the highest reticulation. Consider 
some leaf z ^ S reachable from v and some leaves x,y G S below different highest cut-arcs. Then this 
again leads to a contradiction because xy\z is not consistent with TV. This concludes the proof of the 
claim. □ 

First suppose that a maximal SN-set S equals the set of leaves below a directed path P ending in a 
parent of the highest reticulation. In this case we can modify the network by putting S below a single 
cut-arc, without increasing the number of reticulations. To be precise, if p and p' are the first and 
last vertex of P respectively and r is the highest reticulation, then we subdivide the arc entering p 
by a new vertex v, add a new arc (v, r), remove the arc (p' , r) and suppress the resulting vertex with 
indegree and outdegree both equal to one. It is not too difficult to see that the resulting network is 
still consistent with T. 

Now suppose that some maximal SN-set S equals the set of leaves below a directed path P ending 
in the highest reticulation. The sets of leaves below highest cut-arcs are all SN-sets (as is always the 
case). One of them is equal to BHR(N). If any of the others is contained in a nontrivial SN-set 5" 
that does not contain BH R(N), then the procedure from the previous paragraph can again be used 
to put S' below a highest cut-arc. In the resulting network the sets of leaves below highest cut-arcs 
are indeed equal to BHR(N) and the SN-sets that are maximal under the restriction that they do 
not contain BHR(N). 
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Figure 3. Visualisation of the proof of Lemma 1. From the maximal SN-sets (encircled in the network on the 
left) to the sets of leaves below highest cut-arcs (encircled in the network on the right). Remember that all 
arcs are directed downwards. 



An example is given in Figure 3. In the network on the left one maximal SN-set equals the set of 
leaves below the red path. In the middle is the same network, but now we encircled BHR(N) and the 
SN-sets that are maximal under the restriction that they do not contain BHR(N). There is still an 
SN-set (S') below a path on the cycle (again in red). However, in this case the network can be modified 
by putting S' below a single cut-arc, without increasing the number of reticulations. This gives the 
network to the right, where the sets of leaves below highest cut-arcs are indeed equal to BHR(N) 
and the SN-sets that are maximal under the restriction that they do not contain BHR(N). □ 

Theorem 1. Given a dense set of triplets T, algorithm MARLON constructs a level- 1 network that 
is consistent with T (if such a network exists) and has a minimum number of reticulations in 0(n b ) 
time. 

Proof. The proof is by induction on the size i of S. Suppose that N is an optimal level- 1 network 
consistent with T\S. If BHR(N) = then the sets of leaves below highest cut-arcs are the two 
maximal SN-sets Si and S2. In this case f(S) can be computed by adding up the /(Si) and /(S2). 
Otherwise, it follows from Lemma 1 and the observation that BHR(N) has to be an SN-set, that 
at some iteration the algorithm will consider the set C equal to the sets of leaves below the highest 
cut-arcs of N. In this case the number of reticulations can be computed by adding one to the sum of 
the values f(X) over all X G C. This is because the network N consists of a (highest) cycle, connected 
to optimal networks for the different XeC. By induction, all values of f(X) for \X\ < i have been 
computed correctly and correctness of the algorithm follows. The number of SN-sets is 0(n) because 
any two SN-sets are either disjoint or one is included in the other [17, Lemma 8]. These SN-sets can 
be found in 0(n 3 ) time by computing the SN-tree [16]. Simple level- 1 networks can be found in (9(n 3 ) 
time [16] and TVC can be computed in 0(n 3 ) time. These computations are repeated 0(n 2 ) times: 
for all S G SN and all S' G SN with S' C S. Therefore, the total running time is 0(n 5 ). □ 

4 Experiments 

MARLON has been implemented, tested and made publicly available [19]. For example the network 
in Figure 4 with 80 leaves and 13 reticulations could be constructed by MARLON in less than six 
minutes on a Pentium IV 3 GHz PC with 1 GB of RAM. 

To test the relevance of the constructed networks we applied MARLON to simulated data. The main 
advantage of using simulated data is that it enables us to compare output networks with the "real" 
network. We repeated the following experiment for different level-1 networks, which we in turn as- 
sumed to be the "real" network. Given such a level-1 network, we used the program Seq-Gen [22] to 
simulate sequences that could have evolved according to that network. We simulated a recombinant 
phylogeny by generating sequences of 4000 base pairs, consisting of two blocks of 2000 base pairs 
each. We assumed that each block evolved according to a phylogenetic tree. This means that in each 
simulation, our input to Seq-Gen consisted of two trees Ti and T%. For each reticulation of the level-1 
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Figure 4. Example of a network constructed by MARLON. 



network, T\ uses just one of the incoming arcs and T2 uses the other one. This makes sure that each 
arc of the network is used by at least one of the two trees. Seq-Gen was used with the F81 model of 
nucleotide substitution. 

From these simulated sequences we computed a set of triplets as follows. We assume that for one se- 
quence it is known that it is only distantly related to the others. This is called the outgroup sequence. 
For each combination of three sequences, plus the outgroup sequence, we computed a phylogenetic 
tree using the maximum likelihood method PHYML [8]. The output trees of PHYML give a dense 
triplet set, which we used as input to MARLON. 

All simulations gave similar results. Here we describe the results for one specific "real" level-1 net- 
work, displayed in Figure 5. We obtained the simulated triplet set T* based on this network by the 
procedure described above. For this triplet set MARLON constructed the output network in Figure 6. 
The constructed network is very similar to the input network (which we assumed to be the "real" 
network). Both networks have four reticulations and also the branching structure is almost identical. 
The only differences are all of the following type. The output network contains some subnetworks 
rooted below a parent of a reticulation. In some of these cases the input network is a bit different 
because here the subnetwork is divided below a path on the cycle, ending in the parent of the reticu- 
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Figure 5. The level-1 network on which the simulated triplet set T* is based. 



lation. For example in Figure 5 the leaves 37, 38, 39, 40 are below a path on a cycle consisting of three 
vertices. However, in the output network in Figure 6 these leaves are below a single vertex on the cycle. 

Other simulations give similar results. The networks constructed by MARLON are almost identical to 
the input networks, except for some small differences that are almost all of the type described above. 
In one case the output network also contained an extra reticulation that was not present in the input 
network. In this case there must have been triplets in the simulated triplet set that were not consistent 
with the input network. 

We conclude that MARLON correctly constructs level- 1 networks and works very fast. For simu- 
lated data the produced networks are very close to the "real" networks used to generate the simulated 
sequences. When using real data we expect the amount of incorrect triplets to be larger and hence 
the results possibly less impressive. In addition, real data sets will not always originate from a level-1 
network, in which case MARLON will not be able to compute a solution. This problem will partly be 
solved in the next section where we show how the approach can be extended to level-2. However, the 
main conclusion to be drawn from the experiments is that, if the data is good enough, our method is 
indeed able to produce good estimates of evolutionary histories. This for example shows that, when 
a set of triplets is computed from sequence data, sufficient information is retained to be able to re- 
construct the phylogenetic network accurately. In addition, MARLON provides a very fast method to 
combine these triplets into a phylogenetic network. 
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Figure 6. The network constructed by MARLON for the simulated triplet set T* 



5 Constructing a Level-2 Network with a Minimum Number of 
Reticulations 

This section extends the approach from Section 3 to level-2 networks. We describe how one can find 
a level-2 network consistent with a dense input triplet set containing a minimum number of reticula- 
tions, or decide that such a network does not exist. 

The general structure of the algorithm is the same as in the level- 1 case. We loop though all SN- 
sets S from small to large and compute an optimal solution Ns for that SN-set, based on previously 
computed optimal solutions for included SN-sets. For each SN-set we still consider, like in the level- 1 
case, the possibility that there are two cut-arcs leaving the root of N$ and the possibility that this root 
is in a biconnected component with one reticulation. However, now we also consider a third possibility, 
that the root of Ng is in a biconnected component containing two reticulations. 

In the construction of biconnected components with two reticulations, we use the notion of "non- 
cycle- reachable" -arc, or n.c.r.-arc for short, introduced in [15]. We call an arc a = (u, v) an n.c.r.-arc 
if v is not reachable from any vertex in a cycle. These n.c.r.-arcs will be used to combine networks 
without increasing the network level. In addition, we use the notion highest biconnected component to 
denote the biconnected component containing the root. 

Our complete algorithm is described in detail in Algorithm 2. To get an intuition of why the algorithm 
works, consider the four possible structures of a biconnected component containing two reticulations 
displayed in Figure 7. Let X, Y, Z and Q be the sets of leaves indicated in Figure 7 in the graph 
that displays the form of the highest biconnected component of Ns- Observe that after removing Z in 
each case X, Y and Q become a set of leaves below a cut-arc and hence an SN-set (w.r.t T\(S \ Z)). 
In cases 2a, 2b and 2c the highest biconnected component becomes a cycle, Q the set of leaves below 
the highest reticulation and X and Y sets of leaves below highest cut-arcs. We will first describe the 
approach for these cases and show later how a similar technique is possible for case 2d. 
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Figure 7. The four possible structures of a biconnected component containing two reticulations. 



Our algorithm loops through all SN-sets that are a subset of S and will hence at some iteration 
consider the SN-set Z. The algorithm removes the set Z and computes the SN-sets w.r.t. T\(S \ Z). 
The sets of leaves below highest cut-arcs (in some optimal solution, if one exists) are now equal to 
X, Y, Q and the SN-sets that are maximal under the restriction that they do not contain X, Y or Q 
(by the same arguments as in the proof of Lemma 1). Therefore, the algorithm tries each possible 
SN-set for X, Y and Q and in one of these iterations it will correctly determine the sets of leaves 
below highest cut-arcs. Then the algorithm computes the induced set of triplets, where each set of 
leaves below a highest cut- arc is replaced by a single meta-leaf. All simple level- 1 networks consistent 
with this induced set of triplets are obtained by the algorithm in [16]. Our algorithm loops through 
all these networks and does the following for each simple level-1 network N\. Each meta-leaf V, not 
equal to X or Y, is replaced by an optimal network Ny, which has been computed in a previous 
iteration. To include leaves in Z, X and Y, we compute an optimal network N2 consistent with 
T\(X U Z) and an optimal network N 3 consistent with T\(Y U Z) where in both networks Z is the 
set of leaves below an n.c.r.-arc. Then we combine these three networks into a single network like 
in Figure 8. A new reticulation is created and Z becomes the set of leaves below this reticulation. 
Finally, we check for each constructed network whether it is consistent with T\S. The network with 
the minimum number of reticulations over all such networks is the optimal solution Ns for this SN-set. 

Now consider case 2d. Suppose we remove Z and replace X, Y (=Q) and each SN-set w.r.t. T\(S\Z) 
that is maximal under the restriction that it does not contain X or Y by a single leaf. Then the 
resulting network consists of a path ending in a simple level-1 network, with X a child of the root and 
Q the child of the reticulation; and each vertex of the path has a leaf as child. Such a network can 
easily be constructed and subsequently one can use the same approach as in cases 2a, 2b and 2c. See 
Figure 9 for an example of the construction in case 2d. 

Theorem 2. Given a dense set of triplets T, Algorithm MARLTN constructs a level-2 network that 
is consistent with T (if such a network exists) and has a minimum number of reticulations in 0(n 9 ) 
time. 

Proof. Consider some SN-set S and assume that there exists an optimal solution Ns consistent with 
T\S. The proof is by induction on the size of S. If the highest biconnected component of Ns contains 
one reticulation then the algorithm constructs an optimal solution by the proof of Theorem 1. Hence 
we assume from now on that the highest biconnected component of Ns contains two reticulations. 

Consider the four graphs in Figure 7. Any biconnected component containing two reticulations is 
a subdivision of one of these graphs [13, Lemma 13]. These graphs are called simple level-2 generators 
in [13] and X, Y , Z and Q each label, in each generator, a side of the generator, i.e. either an arc or 
a vertex with indegree 2 and outdegree 0. Suppose that the highest biconnected component of Ns is 
a subdivision of generator G and let the set X (Y, Z, Q respectively) be defined as the set of leaves 
reachable in Ns from a vertex with a parent on the path corresponding to the side labelled X (Y, Z, 
Q respectively) in G. 
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To find an optimal network consistent with T\(X U Z) (or T\(Y U Z)) such that Z is below an 
n.c.r.-arc we can use the following approach. If there are more than two maximal SN-sets then it is 
not possible. Otherwise, we create a root and connect it to two networks for the two maximal SN-sets. 
If one of these maximal SN-sets contains Z as a strict subset then we create a network for this set 
recursively. For other maximal SN-sets we use the optimal networks computed in earlier iterations. 

Given a network N' and a set of leaves V below a cut-arc (u, v) we denote by N' \ L' the net- 
work obtained by removing v and all vertices reachable from v from N', deleting all vertices with 
outdegree zero and suppressing all vertices with indegree and outdegree both one. 

Claim (2). There exists an optimal solution N' such that the sets of leaves below highest cut-arcs of 
N'\Z are X, Y, Q and the SN-sets w.r.t. T\(S \ Z) that are maximal under the restriction that they 
do not contain X, Y or Q. 

Proof. The highest biconnected component of N' \ Z contains just one reticulation and the same 
arguments can be used as in the proof of Lemma 1. □ 

Let N' be a network with the property described in the claim above and C the collection of sets of 
leaves below highest cut-arcs of N' \ Z. At some iteration the algorithm will consider this set C. Let 
T" equal T\(S \Z). If we replace in N' \ Z each set of leaves below a highest cut-arc by a single leaf, 
then we obtain a network consistent with T"VC which is either a simple level- 1 network or a path 
ending in a simple level- 1 network, with X a child of the root, Q the child of the reticulation; and 
each vertex of the path has a leaf as child. The algorithm considers all networks of these types, so in 
some iteration it will consider the right one. Let N* be the network constructed by the algorithm in 
this iteration. It remains to prove that N* (i) is consistent with T\S, (ii) contains a minimum number 
of reticulations and (iii) is a level-2 network. 

To prove that N* is consistent with T\S, consider any triplet xy\z e T\S. First suppose that x,y 
and z are all in Z or all in the same set of C \ {X, Y}. Then x, y and z are elements of some SN-set S' 
with 1 5' | < Triplet xy\z is consistent with the subnetwork Ns> by the induction hypothesis and 
hence with N*. 



Now suppose that x,y and z are all inlUZ (or all in Y U Z). Consider the construction of the 
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Figure 9. Example of the construction of network N from Ni, N2 and N3 in case 2d. 

network consistent with X U Z such that Z is below an n.c.r.-arc. First suppose that at some level 
of the recursion there are two maximal SN-sets, each containing leaves from {x, y, z}. Then it follows 
that x and y are in one maximal SN-set and z in the other one, by the definition of SN-set, and hence 
that xy\z is consistent with the constructed network. Otherwise, x, y and z are all in some subnetwork 
N' s with 1 5' I < 1 5 1 and is xy\z consistent with this subnetwork (by the induction hypothesis) and 
hence with TV*. 

Now consider any other triplet xy\z e T\S, which thus contains leaves that are below at least two 
different highest cut-arcs. Observe that the highest biconnected components of N* and N' are iden- 
tical; the only differences between these networks occur in the subnetworks below highest cut-arcs. 
Therefore xy\z is consistent with N* since it is consistent with N'. 

To show that N* contains a minimum number of reticulations consider any set S' of leaves below 
a highest cut-arc a — (u,v) of N* . The subnetwork Ns> rooted at v contains a minimum number of 
reticulations by the induction hypothesis. Hence N* contains at most as many reticulations as N' , 
which is an optimal solution. 

In the networks N 2 and N 3 is Z the set of leaves below an n.c.r.-arc. This implies that none of 
the potential reticulations in these networks end up in the highest biconnected component of N' . 
Therefore, this biconnected component contains exactly two reticulations. All other biconnected com- 
ponents of N' also contain at most two reticulations by the induction hypothesis. We thus conclude 
that N' is a level-2 network. 

To conclude the proof we analyse the running time of the algorithm. The number of SN-sets is 0(n) 
and hence there are 0(n) choices for each of S, X, Y, Z and Q. For each combination of S, X, Y, Z and 
Q there will be 0(n) networks N* constructed and for each of them it takes 0(n 3 ) time to check if it 
is consistent with T\S (in line 23). Hence the overall time complexity is 0(n 9 ). □ 

6 Constructing Networks Consistent with Precisely the Input Triplet Set 

In this section wc consider the problem MIN-REFLECT-fc. Given a triplet set T, this problem asks 
for a level-fc network N that is consistent with precisely those triplets in T (if such a network exists) 
and has minimum level over all such networks. We will show that this problem is polynomial-time 
solvable for each fixed k. 
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Algorithm 2 MARLTN (Minimum Amount of Reticulation Level Two Network) 

1: - compute the set SN of SN-sets w.r.t. T 
2: for i = 1 . . . n do 

3: for each S in SN of cardinality i do 
4: for each S' G SN with S' C S do 

5: - let C contain S' and all SN-sets that are maximal under the restriction that they are a subset of 

S and do not contain S' 
6: if TVC is consistent with a simple level- 1 network Ni then 

7: - construct iV* from Ni by replacing each leaf V by an optimal network Nv constructed in a 

previous iteration 
8: - g(S,S') is the number of reticulations in N* 

9: if there are exactly two SN-sets Si, & € SN that are maximal under the restriction that they are a 
strict subset of S then 

10: - N* consists of a root connected to the roots of optimal networks Ns ± and Ns 2 that have been 

constructed in previous iterations 
11: - g(S, 0) is the number of reticulations in N* 

12: for each Z € SN with Z C S do 
13: - T" := T\(S \ Z) 

14: - compute the set SN' of SN-sets w.r.t. T" 

15: for each X, Y, Q £ SW' do 

16: - C is the collection consisting of X, Y, Q and all SN-sets in SN' that are maximal under the 

restriction that they do not include X, Y or Q 
17: - construct an optimal network N2 consistent with T\(X U Z) such that Z is the set of leaves 

below an n.c.r.-arc (u, v) 

18: - construct an optimal network N3 consistent with T\(Y U Z) such that Z is the set of leaves 

below an n.c.r.-arc (u',v') 
19: - construct all simple level-1 networks consistent with T'VC 

20: - construct all networks consistent with T'VC that consist of a path ending in a simple level-1 

network, with X a child of the root, Q the child of the reticulation; and with a leaf below each 
internal vertex of the path 
21: for each network Ni from the networks constructed in the above two lines do 

22: - construct N* from Ni by doing the following: replace X by N2, Y by N3 and each other leaf 

V by an optimal network Nv constructed in a previous iteration, then subdivide (u, v) into 
(u,w) and (w,v), delete everything below v! and add an arc (u',w) 
23: if TV* is consistent with T\S then 

24: - h(S, X, Y, Z, Q) is the number of reticulations in N* 

25: - f(S) is the minimum of all computed values of g(S, S') and h(S, X, Y, Z, Q) 

26: - store network Ns, which is a network N* attaining the minimum number f(S) of reticulations 



Recall that we use T(N) to denote the set of all triplets consistent with a network N. Furthermore, 
we say that a network N reflects a triplet set T if T(N) = T. The problem MIN-REFLECT-Zc thus 
asks for a minimum level network N that reflects an input triplet set T, for some fixed upper bound 
k on the level of N . Note that, if N reflects T, that N is in general not uniquely defined by T. There 
are, for example, several distinct simple level-2 networks that reflect the triplet set {xy\z, xz\y, zy\x}. 

Note that MIN-REFLECT-0 can be easily solved by using the algorithm of Aho ct al. [1]. This 
follows because if a tree N is consistent with a dense set of triplets T, then is unique [17] (and 
T(N) = T). 

Theorem 3. Given a dense set of triplets T, it is possible to construct all simple level-k networks 
consistent with T in time 0(|T| fe+1 ). 

Lemma 4. Let N be any simple network. Then all the nontrivial SN-sets of T(N) are singletons. 
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As we will show, the above lemma allows us to solve the problem MIN-REFLECT-fc by recursively 
constructing simple level-fc networks, which wc can do by Theorem 3. This leads to the algorithm 
MINPITS-fc (MINimum level network consistent with Precisely the Input Triplet Set). 

Theorem 4. Given a set of triplets T, Algorithm MINPITS-k solves MIN-REFLECT-k in time 
0{\T\ k+1 ), for any fixed k. 

6.1 Constructing all simple level-fc networks in polynomial time 

To prove Theorem 3 we first need several utility lemmas. Recall the concept of a simple level-k 
generator [13, Section 3.1]. (See also the proof of Theorem 2.) 

Lemma 2. Let G be a simple level-k generator. Then G contains 0(k) vertices and O(k) arcs. 

Proof. Suppose G contains s split vertices, k reticulation vertices and 1 root. Then the total indegree 
equals s + 2k and the total outdegree is at least 2s + 2. Given that the total indegree equals the total 
outdegree we get that s + 2k > 2s + 2 and hence that s < 2k — 2. So the total number of vertices is 
at most 3k — 1. All the vertices have at most degree 3 so there are at most (9A; — 3)/2 arcs. □ 

Lemma 3. Let N be a level-k network on n leaves, where k is fixed. Then N contains 0{n) vertices 
and 0(n) arcs. 

Proof. By the definition of a phylogenctic network we can view N as a rooted, directed "component 
tree" B(N) of biconnected components where every internal vertex of B(N) represents a simple level- 
<k subnetwork of N (or a single vertex), and incident arcs of an internal vertex represent arcs incident 
to the corresponding biconnected component. B(N) has n leaves, and every internal vertex has at least 
two outgoing arcs. B(N) is a tree so it has at most n— 1 internal vertices and thus at most 2n— 1 vertices 
in total, and at most 2n — 2 arcs. Each internal vertex represents a simple level-<fc generator with at 
most (9fc — 3)/2 arcs. Every outgoing arc raises the number of arcs (and vertices) inside the component 
by at most 1. So the total number of arcs in N is bound above by (2n — 2) + (n— l)(9fc — 3)/2 + (2n — 2), 
which is 0(n), and the total number of vertices by n + (n — l)(3fc — 1) + (2n — 2), also 0(n). □ 

Let TV be a network with at least one reticulation vertex, and let v be the child of a reticulation 
vertex in N. If v has no reticulation vertex as a descendant, then we call the subnetwork rooted at v a 
Tree hanging Below a Reticulation vertex (TBR). We additionally introduce the notion of the empty 
TBR, which corresponds to the situation when a reticulation vertex has no outgoing arcs. This cannot 
happen in a normal network but as explained shortly it will prove a useful abstraction. 

Observation 1 Every network N containing a reticulation vertex contains at least one TBR. 

Proof. Suppose this is not true. Let v be the child of a reticulation vertex in N with maximum distance 
from the root. There must exist some vertex v' ^ v which is a child of a reticulation vertex and which 
is a descendent of v. But then the distance from the root to v' is greater than to v, contradiction. □ 

Note that, because a TBR is (as a consequence of its definition) below a cut-arc, there exists an SN-set 
S w.r.t. T such that T\S is consistent with (only) the TBR. An SN-set S such that T\S is consistent 
with a tree, we call a CandidateTBR SN-Sct. Every TBR of N corresponds to some CandidatcTBR 
SN-Sct of T, but the opposite is not necessarily true. For example, a singleton SN-set is a Candi- 
datcTBR SN-Sct, but it might not be the child of a reticulation vertex in N. 

We abuse definitions slightly by defining the empty CandidateTBR SN-Set, which will correspond 
to the empty TBR. (This is abusive because the empty set is not an SN-set.) Furthermore we define 
that every triplet set T has an empty CandidateTBR SN-Set. 

Observation 2 Let T be a dense set of triplets on n leaves. There are at most 0(n) CandidateTBR 
SN-sets. All such sets, and the tree that each such set represents, can be found in total time 0(n 3 ). 
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Proof. First we construct the SN-tree for T, this takes time 0(n 3 ). There is a bijection between the 
SN-sets of T and the vertices of the SN-tree. (In the SN-tree, the children of an SN-set S are the 
maximal SN-sets of T\S.) Observe that a vertex of the SN-tree is a CandidateTBR SN-set if and only 
if it is a singleton SN-set or it has in total two children, and both are CandidateTBR SN-sets. We can 
thus use depth first search to construct all the CandidateTBR SN-sets; note that this is also sufficient 
to obtain the trees that the CandidateTBR SN-sets represent, because (for trees) the structure of the 
tree is identical to the nesting structure of its SN-sets. Given that there are only 0(n) SN-sets, the 
running time is dominated by construction of the SN-tree. □ 

Theorem 3. Given a dense set of triplets T, it is possible to construct all simple level-k networks 
consistent with T in time 0(\T\ k+1 ). 

Proof. We claim that algorithm SL-fc does this for us. First we prove correctness. The high-level idea 
is as follows. Consider a simple level-fc network N . From Observation 1 we know that N contains 
at least one TBR. (Given that N is simple we know that all TBRs are equal to single leaves. That 
is why the outermost loop of the algorithm can restrict itself to considering only single- leaf TBRs.) 
By looping through all CandidateTBR SN-sets we will eventually find one that corresponds to a real 
TBR. If we remove this TBR and the reticulation vertex from which it hangs, and then suppress any 
resulting vertices with both indegree and outdegree equal to 1, we obtain a new network (not neces- 
sarily simple) with one fewer reticulation vertex than N . Note that this new network might not be a 
"real" network in the sense that it might have reticulation vertices with no outgoing arcs. Repeating 
this k times in total we eventually reach a tree which we can construct using the algorithm of Aho et 
al. (and is unique, as shown in [17]). From this tree we can reconstruct the network N by reintroducing 
the TBRs back into the network (each TBR below a reticulation vertex) in the reverse order to which 
we found them. We don't, however, know exactly where the reticulation vertices were in N, so every 
time we reintroduce a TBR back into the network we exhaustively try every pair of arcs (as the arcs 
which will be subdivided to hang the reticulation vertex, and thus the TBR, from.) Because we try 
every possible way of removing TBRs from the network N, and every possible way of adding them 
back, we will eventually reconstruct N. 

The role of the dummy leaves in SL-fc is linked to the empty TBRs (and their corresponding empty 
CandidateTBR SN-Sets.) When a TBR is removed, it can happen (as mentioned above) that a net- 
work is created containing reticulation vertices with no outgoing arcs. (For example: when one of the 
parents of a reticulation vertex from which the TBR hangs, is also a reticulation vertex.) Conceptually 
we say that there is a TBR hanging below such a reticulation vertex, but that it is empty. Hence 
the need in the algorithm to also consider removing the empty TBR. If this happens, we will also 
encounter the phenomenon in the second phase of the algorithm, when we are re-introducing TBRs 
into the network. What do we insert into the network when we reintroduce an empty TBR? We use a 
dummy leaf as a place-holder, ensuring that every reticulation vertex always has an outgoing arc. The 
dummy leaves can be removed once that outgoing arc is subdivided later in the algorithm, or at the 
end of the algorithm, whichever happens sooner. The dummy root, finally, is needed for when there 
are no leaves on a side (see [13, Section 3.1]) connected to the root. 

We now analyse the running time. For k G {1,2} we can actually generate all simple level- 1 net- 
works in time 0(n 3 ) using the algorithm in [16], and all simple level-2 networks in time 0(n 8 ) using 
the algorithm in [13]. For k > 3 we use SL-/c. From Observation 2 we know that each execution of 
FindC andidateT B Rs (which computes all TBRs in a dense triplet set plus the empty TBR) takes 
0(n 3 ) time and returns at most 0(n) TBRs. Operations such as computing T(, and the construction 
of the tree N' k , all require time bounded above by 0(n 3 ). The for loops when we "guess" the TBRs 
are nested to a depth of k. The for loops when we "guess" pairs of arcs from which to hang TBRs, 
are also nested to a depth of k. (There will only be 0(n) arcs to choose from.) Checking whether N' 
is consistent with T, which we do inside the innermost loop of the entire algorithm, takes time 0(n 3 ) 
[6, Lemma 2]. So the running time is 0(n(n 3 + n(n 3 . . . n(n 3 + n 2k+3 )) which is 0(n 3k+3 ). □ 
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Algorithm 3 SL-fc (Construct all Simple Level-fc networks) 

1: Net := 

2: TBRi := L(T) 

3: for each leaf bi € TBRi do 

4: Li :=L(T)\ {61} 

5: T/ := T\L' L 

6: TBR2 := FindCandidateTBRs(T[) 

7: for each b 2 G TBR2 do 

8: 

{ Continue nesting for loops to a depth of fc. } 

9: 

10: TBR k := FindCandidateTBRslT^) 

11: for each 6^ 6 TBRk do 

12: 14 := \ 6* 

13: r fc := T fc _ 1 |L fe 

{ At this point we have finished "guessing" where the TBRs arc, } 

{ and ({61}, 62, bk) is a vector of (possibly empty) subsets of L(T). } 

{ We now "guess" all possible ways of hanging the TBRs back in. } 

14: if L' k contains 2 or more leaves then 

15: build the unique tree N' k — (V, A) consistent with T k if it exists (see [17]) 

16: else 

17: If L' k contains 1 leaf {x}, let N' k be the network comprising the single leaf {x} 

18: If L' k contains leaves, let N' k be the network comprising a single, new dummy leaf 

19: V := V U {/}; A := A U {(/, r)} { with r the root of iV^. and r' a new dummy root } 

20: Let H(b k ) be the unique tree consistent with b k 

{ Note: H(bk) is a single vertex if \b k \ = 1 and empty if \b k \ = 0. } 
21: for every two arcs a k , a\ in N' k (not necessarily distinct) do 

22: Let p (respectively q) be a new vertex obtained by subdividing a\ (respectively o|) 

23: Connect p and q to a new reticulation vertex ret k 

24: Hang H(b k ) (or a new dummy leaf if H(b k ) is empty) from ret k 

25: if a\ (or a k ) was the arc above a dummy leaf d then 

26: Remove d and if its former parent has indegree and outdegree 1, suppress that 

27: Let be the resulting network 

28: Let H(b k -i) be the unique tree consistent with b k -i 

29: for every two arcs a\_ 1 , a k _ 1 in N' k _ x (not necessarily distinct) do 

30: 

{ Continue nesting for loops to a depth of k. } 

31: 

32: Let N{ be the resulting network 

33: Let H(b\) be the tree consisting of only the single vertex bi 

34: for every two arcs a\, a\ in iV{ (not necessarily distinct) do 

35: Let p (respectively q) be a new vertex obtained by subdividing a\ (respectively a\) 

36: Connect p and q to a new reticulation vertex reti 

37: Hang H(bi) from reti 

38: if a\ (or a\) was the arc above a dummy leaf d then 

39: Remove d and if its former parent has indegree and outdegree 1, suppress that 

{ This is the innermost loop of the algorithm. } 
40: Let N 1 be the resulting network 

41: Remove the dummy root r' from N' 

42: Remove (and if needed suppress former parents of) any remaining dummy leaves in N' 

43: if N' is a simple level-fc network consistent with T then 

44: Net := Net U {TV'} 

45: return Net 
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Corollary 1. For fixed k and a triplet set T it is possible to generate in time 0(n 3k+3 ) all simple 
level-k networks N that reflect T . 

Proof. The algorithm SL-fc (or, for that matter, the algorithms from [16] [13]) can easily be adapted 
for this purpose: we change in line 43 "network consistent with T" to "network that reflects T" . The 
running time is unchanged because, whether we are checking consistency or reflection, the implemen- 
tation of [6, Lemma 2] implicitly generates T(N'). □ 



6.2 Prom simple networks that reflect, to general networks that reflect 

For a triplet t = xy\z and a network TV, we define an embedding of t in N as any set of four paths 
(q^x,q^y,p^q,p^z) which, except for their endpoints, are mutually vertex disjoint, and 
where p ^ q. We say that the vertex p is the summit of the embedding. Clearly, t is consistent with 
N if and only if there is at least one embedding of t in N. 

Lemma 4. Let N be any simple network. Then all the nontrivial SN-sets of T(N) are singletons. 

Proof. We prove the lemma by contradiction. Assume thus that there is some SN-set S of T(N) such 
that'l < \S\ < \L(N)\. 

Let r be the root of TV. An in-out root embedding is an embedding of any triplet xz\y with x,z E S 
and y S that has r as its summit. We begin by proving that an in-out root embedding exists in N. 
Suppose by contradiction this is not true. The triplet xz\y must be in T{N) because T(N) is dense 
and y £ S. Consider then a triplet embedding (q — > x,q — > z,p — > q,p — > y) where p ^ r, x, s e S 
and y £ S. We assume without loss of generality that this embedding minimises (amongst all such 
embeddings) the length of the shortest directed path from r to p. Let P be any shortest directed path 
from r to p. Now, suppose some directed path Q begins somewhere on the path P, and intersects 
with the path p — > y. This is not possible because it would contradict the minimality of the length 
of P. For the same reason, Q may not intersect with the path p — > q. Q may also not intersect with 
(without loss of generality) the path q — > x because this would mean y G S. We conclude that such 
a path Q either terminates at a leaf I, or re-intersects with the path P. It cannot terminate at a leaf 
I because then we either violate the minimality of the length of P (because we obtain an embedding 
of xz\l that has a summit closer to the root than p), or we have that y € S. We conclude that all 
outgoing paths from P must re-intersect with P. However, given that P includes the root, and that 
in a directed acyclic graph every vertex is reachable by a directed path from the root, it follows that 
the last arc on the path P must be a cut-arc. But this violates the biconnectivity of N, contradiction. 
We conclude that there exists at least one in-out root embedding in N. 

Let (q — > x, q — > z, r — > q, r — > y) be any in-out root embedding. We observe that the path r — > y 
must contain at least one internal vertex, by biconnectivity. Also, at least one of q — ► x and q — > y 
must contain an internal vertex, because it is not possible for a vertex in a simple network to have 
two leaf children. 

We now argue that there must exist a twist cover of the path r — > q. This is defined as a non-empty 
set C of undirected paths (undirected in the sense that not all arcs need to have the same orientation) 
where (i) all paths in C are arc-disjoint from the in-out root embedding, (ii) exactly one path starts 
at an internal vertex s of (without loss of generality) q — > z, (iii) exactly one path ends at an internal 
vertex t of r — > y, (iv) all other start and endpoints of the paths in C lie on r — > q and (v) for every 
vertex v of the path r — > q (including r and q), there is at least one path in C that has its startpoint 
to the left of v, and its endpoint to the right. Property (v) is crucial because it says (informally) that 
every vertex on r — > q is "covered" by some path that begins and ends on either side of it and is 
arc-disjoint from the embedding. The length of C is defined to be the sum of the number of arcs in 
each path in C. 

Suppose however that a twist cover does not exist. We define a partial twist cover as one that 
satisfies all properties of a twist cover except property (v) . Partial twist covers have thus at least one 
vertex on r — > q that is not covered. (To see that there always exists at least one partial twist cover 
note that properties (ii) and (iii) in particular are satisfied by the fact that neither the removal of q or 
r can be allowed to disconnect N.) So let C be the partial twist cover with the maximum number of 
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uncovered vertices. Let d be the uncovered vertex that is closest to q. If we removed d we would, by 
definition, disconnect the union of the paths in C with the in-out root embedding into a left part G 
and a right part H. The vertex d does not, however, disconnect N, so there must be some path P not 
in C that begins somewhere in G and ends somewhere in H . If P has its startpoint on a path X £ C 
(where X will be in G) and/or an endpoint on a path Y (where Y will be in H) then these paths 
can be "merged" into a new path that strictly increases the number of vertices covered. The merging 
occurs as follows. We take the union of the arcs in P with those in X and/or Y and discard superfluous 
arcs until we obtain a path that covers a strict superset of the union of the vertices covered by X 
and/or Y. (In particular, the fact that P begins in G and ends in H means that the vertex d becomes 
covered.) In this way we obtain a new partial twist-cover with fewer uncovered vertices, contradiction. 
If P has both its startpoint and endpoint on vertices of r — > q that arc not on paths in C, then P 
can be added to the set C and this extends the number of covered vertices, contradiction. If P begins 
and/or ends elsewhere on the embedding then P can be added to C which again increases the number 
of vertices covered, contradiction. (If P begins on, without loss of generality, q — > z then it becomes 
the new property-(ii) path and the old property-(ii) path should be discarded. Symmetrically, if P 
ends onr^j/ then it becomes the new property- (hi) path and the old property- (hi) path should be 
discarded.) 

We conclude that for every in-out root embedding there thus exists a twist cover, and in particular 
a minimum length twist cover. 

We observe that a minimum-length twist cover C has a highly regular, interleaved structure. This 
regularity follows because it cannot contain paths that completely contain other paths (simply discard 
the inner path) and if two paths X, Y E C have startpoints that are both covered by some other path 
Z E C, and (without loss of generality) X reaches further right than Y, then we can simply discard 
Y. For similar reasons minimum-length twist covers are vertex- and arc-disjoint. In Figure 10 we show 
several simple examples of twist covers exhibiting this regular structure (although it should be noted 
that minimum- length twist covers can contain arbitrarily many paths.) 

Let C be the twist cover of minimum length ranging over all in-out root embeddings of xz\y 
where x, z £ S and y S. Note that if C contains exactly one path, which is a directed path, then 
(irrespective of the path orientation) y E S, contradiction. The high-level idea is to show that we can 
always find, by "walking" along the paths in C, a new in-out root embedding and twist cover that 
is shorter than C, yielding a contradiction. Let X be the path in C that begins at s, and consider 
the arc on this path incident to s. The first subcase is if this arc is directed away from s. Continuing 
along the path we will eventually encounter an arc with opposite orientation. This must occur before 
any intersection of X with the path r — > q because otherwise there would be a directed cycle. Let v 
be the vertex between these two arcs, v is a reticulation vertex with indegree 2 and outdegree 1, so 
there must exist a directed path Q leaving v which eventually reaches a leaf m. If Q intersects with 
r — > y then we have that y € S, contradiction. If Q intersects with either q — > x or q — > z then we 
obtain a new in-out root embedding of xz\y and a new twist cover for that embedding that is shorter 
than C, contradiction. If Q does not intersect with the embedding at all, then it must be true that 
m E S (because the triplet mz\x is in the network). But then we have an in-out root embedding of 
the triplet zm\y with twist cover shorter than C, contradiction. 

The second subcase (see Figure 11) is when the first arc of X is entering s. There must exist some 
directed path R from r to s that uses this arc. The fact that r is the summit of the embedding means 
that at some point this directed path departs from the embedding. Let w be the vertex where R 
departs from the embeddings for the last time. If w is on the path r — > y then it follows that y E S, a 
contradiction. If w is on one of the paths q — > x or q — > z then leads to a new in-out root embedding 
with shorter twist cover, a contradiction. The last case is when w lies on the path r — > q. We create 
a new in-out root embedding by using the part of R reachable from w, as an alternative route to z. 
In this way w becomes the "q" vertex of the new embedding (denoted q' in the figure) . To see that 
we also obtain a new twist cover, note principally that paths in C that covered w become legitimate 
candidates for property- (ii) paths in the new twist cover; in the figure s' denotes the beginning of the 
property- (ii) path in the new cover. (Such a path can however partly overlap with R. In this case it is 
necessary to first remove the part that overlaps with R.) We can furthermore discard all paths from 
C that covered w except for the one with endpoint furthest to the right. Even if this means that no 
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paths from C are discarded (this happens when w is to the left of all the paths in C that have their 
beginning points on r — > q) we still get a twist cover at least one edge smaller than C, because (in 
particular) the first edge of X is no longer needed in C . In any case, contradiction. □ 




xz yxz yxz y 



Figure 10. Several examples of twist covers (the red, undirected paths) from the proof of Lemma 4. Note 
that these exhibit the regular, interleaved structure associated with minimum-length twist covers. 




x z yxz y 



Figure 11. The case in the proof of Lemma 4 where the arc incident to s is incoming. 



Corollary 2. Let T be a set of triplets, and suppose there exists a simple network N that reflects T. 
Let N' be any network that reflects T . Then N' is also simple. 

Proof. If N is not simple then it contains a cut-arc below which at least two leaves can be found. 
We know [13, Lemma 3] that every cut-arc of a network consistent with T defines an SN-set w.r.t. T 
equal to the set of leaves below it. But all the SN-sets of T are singletons, contradiction. □ 

Let T be a reflective set of triplets and N be a network that reflects T. Define Collapse(N) as the 
network obtained by, for each highest cut-arc a = (u, v), replacing v and everything below it by a single 
new leaf V, which we identify with the set of leaves below a. Let V be the leaf set of Collapse(N). 
We define a new set of triplets T" on the leaf-set V as follows: XY\Z S T' if and only if there exists 
x E X, y G Y and z G Z such that xy\z G T. We write T 1 = CutInduce(N,T) as shorthand for the 
above. 

Observation 3 Let T be a reflective set of triplets, and let N be some network that reflects T. Then 
(1) T is dense, (2) T' = CutInduce(N,T) is also reflective, and Collapse(N) reflects T' and (3) the 
maximal SN-sets ofT', which are in 1:1 correlation with the maximal SN-sets ofT, are all singletons. 
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Algorithm 4 MINPITS-fc (MINimum level network consistent with Precisely the Input Triplet Set) 
1: N := 

2: compute the set SN of maximal SN-sets of T 
3: if \SN\ = 2 then 

4: TV consists of a root connected to two leaves: the elements of SN 
5: else 

6: if there exists a simple level- < k network that reflects TVSN then 
7: let N be such a network of minimum level 
8: for each leaf V of TV do 

9: recursively create a level-fc network Nv of minimal level that reflects T\V 
10: if iV / and all N v ± then 

11: replace each leaf V of N by the recursively created Nv- 

12: return N 

13: else 

14: return 



Proof. The proof of (1) is trivial. For (2), observe firstly that by [13, Lemma 11.2] (the entire lemma 
generalises easily to higher level networks) the network Collapse(N) is simple and consistent with T". 
Suppose however there is some triplet XY\Z G T(Collapse(N)) that is not in T". But we know then 
that for any x £ X, y E Y, z <E Z the triplet xy\z is in T(N), and thus also in T, but which would 
mean that XY\Z is in T'. For (3) we argue as follows. By [13, Lemma 11.3] it follows that there is 
a 1:1 correlation between the maximal SN-sets of T and those of T". Combining the fact that X" is 
reflective and that Collapse(N) is a simple network that reflects T', we can conclude from Lemma 4 
that the maximal SN-sets of T 1 are all singletons. □ 

Lemma 5. Let T be a reflective set of triplets and let SN be the maximal SN-sets of T. Let T' = 
TVSN be the set of triplets induced by SN. Let N' be a simple network of minimum level that reflects 
T' . Then replacing each leafV of N' by a network that reflects T\V and is of minimum level amongst 
such networks, yields a network N that reflects T and is of minimum level. 

Proof. We first prove that N reflects T. Let TV be any network that reflects T. Every highest cut-arc 
of N° corresponds exactly to a maximal SN-sct w.r.t. T. Hence T = TVSN = CutInduce(N° ,T). 
Note also that for a maximal SN-set S G SN, the set of triplets T\S is reflective: the subnetwork of 
TV below the cut-arc corresponding to S reflects T\S. (This ensures that the recursive step does find 
some network.) We know from Theorem 3 of [13] that the network N is at least consistent with T. 
But suppose there exists some triplet xy\z G N that is not in T. By construction it cannot be the case 
that x,y,z are all below the same highest cut-arc in N (equivalcntly: in the same maximal SN-sct 
w.r.t. T). If exactly two of the leaves are below the same highest cut-arc, then we see immediately 
that xy\z is in T = T(N°). Suppose then that all three leaves are from different maximal SN-sets X, 
Y, Z respectively. Then XY\Z G T" because T 1 — T(N'). So we can conclude the existence of some 
x' G X,y' G Y, z' G Z such that x'y'\z' G T. But it then also follows that xy\z G T. We now prove 
that N is of minimal level. Observe that all networks IVo that reflect T have the same set of highest 
cut-arcs, in the sense that each highest cut-arc always corresponds to exactly one maximal SN-set 
w.r.t. T. Given that the subnetworks created for the T\V are of minimal level, and that the level of N 
is equal to the maximum level ranging over N' and all subnetworks below highest cut-arcs, it follows 
that N is of minimum level if and only if N' is of minimum level. □ 

Theorem 4. For fixed k we can solve MIN-REFLECT-fc in time 0(|T| fc+1 .) 

Proof. From Observation 3 we know that there is no solution if T is not dense, and this can be checked 
in 0{\T\) time. We henceforth assume that T is dense. For k — we can simply use the algorithm of 
Aho et al., which (with an advanced implementation [17]) can be implemented to run in time 0(n 3 ), 
which is 0(|T|). For k > 1 we use algorithm MINPITS-fc. Correctness of the algorithm follows from 
Lemma 5. It remains to analyze the running time. A simple level-fc network (that reflects the input) 
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can be found (if it exists) in time 0(n 3k+3 ) using algorithm SL-fc. (To find the simple network of 
minimum level we execute in order SL-1, SL-2, SL-fc until we find such a network. This adds a 
multiplicative factor of fc to the running time but this is absorbed by the 0{.) notation for fixed fc.) 
Therefore, lines 6 and 7 of MINPITS-fc take 0(\SN\ 3k+3 ) time. At every level of the recursion the 
computation of the maximal SN-sets of T can be done in time 0(n 3 ), and computation of TVSN also 
takes 0(n 3 ). The critical observation is that (by Observation 3) every SN-set in T appears exactly once 
as a leaf inside an execution of SL-fc. The overall running time is thus of the form 0(JN( n3 + s i k+3 )) 
where is equal to the total number of SN-sets in T. Noting that s i ^ ^ (Ei s i) 3k+3 , and 
that there are at most 0(n) SN-sets in T, we obtain for fc > 1 an overall running time of 0(n 3k+3 ), 
which is 0(|T| fc+1 ) because T is dense. □ 

7 Conclusions and open questions 

In this article we have shown that, for level 1 and 2, constructing a phylogenetic network consistent 
with a dense set of triplets that minimises the number of reticulations (i.e. DMRL-1 and DMRL-2), is 
polynomial-time solvable. We feel that, given the widespread use of the principle of parsimony within 
phylogenetics, this is an important development, and testing on simulated data has yielded promising 
results. However, the complexity of finding a feasible solution for level-3 and higher, let alone a min- 
imum solution, remains unknown, and this obviously requires attention. Perhaps the feasibility and 
minimisation variants diverge in complexity, for high enough fc, it would be fascinating to explore this. 

We have also shown, for every fixed fc, how to generate in polynomial time all simple levcl-fc net- 
works consistent with a dense set of triplets. This could be an important step towards determining 
whether the aforementioned feasibility question is tractable for every fixed fc. We have used the SL-fc 
algorithm to show how, for every fixed fc, MIN-REFLECT-fc is polynomial-time solvable. Clearly the 
demand that a set of triplets is exactly equal to the set of triplets in some network is an extremely 
strong restriction on the input. However, for small networks and high accuracy triplets such an as- 
sumption might indeed be valid, and thus of practical use. In any case, the concept of reflection is 
likely to have a role in future work on "support" for edges in phylogenetic networks generated via 
triplets. Also, there remain some fundamental questions to be answered about reflection. For example, 
the complexity of the question "does any network N reflect T?" remains unclear. 
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